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Summary. In extensive form noncooperative game theory, at each instant t, each 
agent i sets its state Xi independently of the other agents, by sampling an associated 
distribution, qi(xi). The coupling between the agents arises in the joint evolution of 
those distributions. Distributed control problems can be cast the same way. In those 
problems the system designer sets aspects of the joint evolution of the distributions 
to try to optimize the goal for the overall system. Now information theory tells 
us what the separate qi of the agents are most likely to be if the system were to 
have a particular expected value of the objective function G(x i,X 2 , ■■■)■ So one can 
view the job of the system designer as speeding an iterative process. Each step of 
that process starts with a specified value of E(G), and the convergence of the qi 
to the most likely set of distributions consistent with that value. After this the 
target value for E q [G) is lowered, and then the process repeats. Previous work 
has elaborated many schemes for implementing this process when the underlying 
variables Xi all have a finite number of possible values and G does not extend to 
multiple instants in time. That work also is based on a fixed mapping from agents to 
control devices, so that the the statistical independence of the agents’ moves means 
independence of the device states. This paper also extends that work to relax all of 
these restrictions. This extends the applicability of that work to include continuous 
spaces and Reinforcement Learning. This paper also elaborates how some of that 
earlier work can be viewed as a first-principles justification of evolution-based search 
algorithms. 


1 Introduction 

This paper considers the problem of adaptive distributed control [18, 27, 23]. There 
are several equivalent ways to mathematically represent such problems. In this paper 
the representation of extensive form noncooperative game theory is adopted[13, 4, 
24, 3, 12] . In that representation, at each instant t each control agent i sets its state 
x \ independently of the other agents, by sampling an associated distribution, q\(x\). 
In this view the coupling between the agents does not arise directly, via statistical 
dependencies of the agents’ states at the same time t. Rather it arises indirectly, 
through the stochastic joint evolution of their distributions {ql} across time. 
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More formally, let time be discrete, where at the beginning of each t all control 
agents simultaneously and independently set their states (“make their moves”) by 
sampling their associated distributions. After they do so any remaining portions of 
the system (i.e., any stochastic part not being directly set by the control agents) 
responds to that joint move. Indicate the state of the entire system at time t as y t . 
(y t includes the joint move of the agents, x t , as well as the state at t of all stochastic 
elements not directly set by the agents.) So the joint distribution of the moves of 
the agents at any moment t is given by the product distribution q t {x t ) = JX- q\(x\), 
and the state of the entire system, given joint move x t , is governed by P(y t | a; 3 4 ). 

Now in general the observations by agent i of aspects of the system’s state at 
times previous to t will determine qj. In turn, those observations are determined by 
the previous states of the system. So qj is statistically dependent on the previous 
states of the entire system, y ^ <t ' t . Accordingly, the system can be viewed as a 
multi-stage noncooperative game among the agents and Nature. Each agent plays 
mixed strategies {qj} at moment t, and Nature’s move space at that time consists 
of those components of the vector y 4 not contained in x 4 [13, 4, 24, 3, 12]. The inter- 
dependence of the agents across time can be viewed as arising through information 
sets and the like, as usual in game theory. 

For pedagogical simplicity, consider the problem of inducing an optimal state 
y rather than the problem of inducing an optimal sequence of states. What the 
designer of the system can specify are the laws that govern how the joint mixed 
strategy q f gets updated from one stage of the game (i.e., one t) to the next. The 
goal is to specify such laws that will quickly lead to a good value of an overall 
objective function of the state of the system, F(y). 3 Note that the agents work in 
the space of a;’s; all aspects of the system not directly set by the agents, and in 
particular all noise processes, are implicitly contained in the distribution P(y \ x). 
Tautologically then, in distributed control the goal is to induce a joint strategy q(x) 
with a good associated value of E q {F) = f dxq{x)E{F \ x ). Defining the world 
utility G(x) = f dyF(y)P(y | *), we can re-express E q (F) purely in terms of x, as 
J dxq{x)G{x) = E q (G). 4 Once such a q is found, one can sample it to get a final x, 
and be assured that, on average, the associated F value is low. In other words, such 
sampling is likely to give us a good value of our objective. 

Previous work has has elaborated several iterative schemes for updating product 
distributions q to monotonically lower E q (G ) [28, 29, 30, 31, 22, 32, JJ. In all of 
these schemes, each q in the sequence is defined indirectly, as the minimizer of 
a different G-parameterized Lagrangian, 2*f(q). Implementing such a sequence of 
Lagrangian-minimizing q’s results in the optimal control policy for the distributed 
system, i.e., in the q minimizing E q (G). However while one cannot directly solve for 
the q minimizing E q (G) in a distributed manner, as elaborated below one can solve 
for the q minimizing each ^£{q) in a distributed manner. In this way one can find 
the optimal distributed control policy using a purely distributed algorithm. 

Many of these schemes are based on a steepest descent algorithm for each step 
of minimizing a Lagrangian JF{q). Because the descent is over Euclidean vectors q, 


3 Here we follow the convention that lower F is better. In addition, for simplicity 
we only consider objectives that depend on the state of the system at a single instant; 
it is straightforward to relax this restriction. 

4 For simplicity, here we indicate integrals of any sort, including point sums for 
countable x, with the f symbol. 
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these algorithms can be applied whether the Xi are categorical/symbolic, continuous, 
time-extended, or a mixture of the three. So in particular, they provide a principled 
way to do “gradient descent over categorical variables” . 

Not all previously considered algorithms for how to perform the Lagrangian- 
minimizing step are based on steepest descent. However they do have certain other 
characteristics in common. One is that the underlying variables x t all have a finite 
number of possible values. Another is G does not extend to multiple instants in 
time. This paper shows how to relax these restrictions, simply by redefining the 
spaces involved. This allows the previously considered algorithms to be used for 
continuous spaces, and also implement Reinforcement Learning (RL) [25, 16, 10, 14]. 
A final shared characteristic is that all of the previously considered algorithms for 
minimizing the Lagrangians employ a fixed mapping from the moves of agents to the 
setting of control devices, so that the statistical independence of the agents’ moves 
means independence of the device states. This paper also shows how that restriction 
can be relaxed, so that independent agents can result in coupled control devices. 

The general mathematical framework for casting control and optimization prob- 
lems in terms of minimizing Lagangians of probability distributions is known as 
“Probability Collectives”. The precise version where the probability distributions 
are product distributions is known as “Product Distribution” (PD) theory [28]. It 
has many deep connections to other fields, including bounded rational game theory 
and statistical physics [29]. As such it serves as a mathematical bridge connecting 
these disciplines. Some initial experimental results concerning the use of PD theory 
for distributed optimization and distributed control can be found in [2, 19, 1, 5, 7]. 
[21, 2, 19, 1, 5, 7], 

The next section reviews the salient aspects of PD theory. The section does 
not consider any of the schemes for the Lagrangian-minimizing step of adaptive 
distributed control in great detail; the interested reader is directed to the literature. 
However it is shown in that section how those schemes provide a first-principles 
justification of certain types of evolution-based search algorithms. 

The following section presents two ways to cast PD theory for uncountably 
infinite x as PD theory for countable X. This allows us to apply all the standard 
finite- A' algorithms even for uncountable X. Experimental tests validating one of 
those ways of recasting PD theory are presented in [6]. The following section shows 
how to recast single-instant PD theory to apply to the RL domain, in which y is 
time-extended. That section considers both episodic and discounted sum RL. The 
final section considers varying the mapping from the moves of agents to the setting 
of control devices. Experimental tests validating the usefulness of such variations 
are presented in [1[. 


2 Review of PD theory 

Say the designer stipulates a particular desired value of E(G), 7. For simplicity, 
consider the case where the designer makes no other claims concerning the sys- 
tem besides 7 and the fact that the joint strategy is a product distribution. Then 
information theory tells us that the a priori most likely q consistent with that infor- 
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mation is the one that maximizes entropy subject to that information [ 9 , 20 , 15]. 5 
In other words, of all distributions that agree with the designer’s information, that 
distribution is the “easiest” one to induce by random search. 

Given this, one can view the job of the designer of a distributed control system 
as an iterative equilibration process. In the first stage of each iteration the designer 
works to speed evolution of the joint strategy to the q with maximal entropy subject 
to a particular value of 7. Once we have found such a solution we can replace the 
constraint — replace the target value of E(G) — with a more difficult one, and then 
repeat the process, with another evolution of q [ 28 ]. 

To formalize this, define the maxent Lagrangian by 


JZ{q) = Jf y {q) = P(E q (G) — 7) — 5 

f= P(J dxq(x)G(x) -7) - S(q), (1) 

where S(q) is the Shannon entropy of q, — f dxq(x)lnj^l , and for simplicity we here 
take the prior fj, to be uniform. 6 . Given 7, the associated most likely joint strategy 
is the q that minimizes over all those (q, p) such that the Lagrange parameter 

P is at a critical point of «Sf 7 , i.e. , such that §7 = 0 . 

Solving, we find that the qi are related to each other via a set of coupled Boltz- 
mann equations (one for each agent i), 


—f3E p (G\xi) 

qf (xi) oc e 


( 2 ) 


where the overall proportionality constant for each i is set by normalization, the 
subscript q ^ on the expectation value indicates that it is evaluated according to 
the distribution Ylj=a Qi 1 an d P is set to enforce the condition E q p ( G ) = 7. Following 
Nash, we can use Brouwer’s fixed point theorem to establish that for any fixed p, 
there must exist at least one solution to this set of simultaneous equations. 

In light of the foregoing, one natural choice for an algorithm that lowers E q (G) 
is the repeated iteration of the following step: Start with the q 13 matching a current 
7 value, then lower 7 slightly, and end by modifying the old q 13 to find the one that 
matches the new 7. A difficulty with this iterative step is the need to solve for p 
as a function of 7. However we can use a trick to circumvent this need. Typically if 
we evaluate E(G) at the solutions q 13 , we find that it is a declining function of p. 
So in following the iterative procedure of equilibrating and then lowering 7 we will 
raise p. Accordingly, we can avoid the repeated matching of p to each successive 
constraint E(G) = 7, and simply monotonically increase P instead. This allows us 
to avoid ever explicitly specifying the values of 7 [ 31 ]. 

An alternative interpretation of this iterative scheme is based on prior knowledge 
of the value of the entropy rather of the expected G. Given this alternative prior 

Tn light of how limited the information is here, the algorithms presented below 
are best-suited to “off the shelf’ uses; incorporating more prior knowledge allows 
the algorithms to be tailored more tightly to a particular application. 

throughout this paper the terms in any Lagrangian that restrict distributions 
to the unit simplices are implicit. The other constraint needed for a Euclidean vector 
to be a valid probability distribution is that none of its components are negative. 
This will not need to be explicitly enforced in the Lagrangian here. 
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knowledge, we can recast the designer’s goal as finding the q that is consistent with 
that knowledge that has minimal E(G). This again leads to Eq.’s 1 and 2. Now 
raising (3 is cast as lowering the (never-specified) prior knowledge of the entropy 
value rather than the (never-specified) prior knowledge of E(G). 

Simulated annealing is an example of this approach, where rather than work 
directly with q, one works with random samples of it formed via the Metropolis 
random walk algorithm [17, 11, 8, 26]. There is no a priori reason to use such an 
inefficient means of manipulating q however. In [31] for example one works with q 
directly instead. This results in an algorithm that is not simply “probabilistic” in 
the sense that the updating of its variables is stochastic (as in simulated annealing). 
Rather the very entity being updated is a probability distribution. 

Another advantage of casting the problem directly in terms of the maxent La- 
grangian is that one can even avoid the need to explicitly stipulate an annealing 
schedule. In the usual way, first order methods can be used to find the saddle point 
of the Lagrangian, e.g., by performing steepest ascent of 2^ in the Lagrange param- 
eter (3 while performing a descent in q 7 . 

In many situations one should use a modification of the maxent Lagrangian. 
Whenever one has extra prior knowledge about the problem domain, that should be 
used to modify the use of entropy as (in statistics terminology) a regularizer. This 
leads to Bayesian formulations [31]. Similarly, if one has constraints {fi{x) = 0}, 
the Lagrangian has to be modified to accout for them. The most naive way of doing 
this is to simply cast the constraints as Lagrange penalty terms {E(fi) = 0} and 
add those terms to the Lagrangian, in the usual way [31, 7] 8 . 

2.1 How to find minima of the Lagrangian 

Consider the situation where each Xi can take on a finite number of possible values, 

| X, | , and we are interested in the unconstrained maxent Lagrangian. Say we are 
iteratively evolving q to minimize «£? for some fixed (3, and are currently at some 
point q in Q, the space of product distributions (i.e., in the Cartesian product of 
unit simplices). Using Lemma 1 of [31], we can evaluate the direction from q within 
Q that, to first order, will result in the largest drop in the value of J?(q)'. 

( 3 ) 

x- 

where Ui(j) = (3E(G \ Xi = j) + ln[gf(j)], and the symbol indicates that we do not 
mean the indicated partial derivative, formally speaking, but rather the indicated 
component of the lst-order descent vector 9 . 

' Formally, since the maxent Lagrangian is not convex, we have no guarantee that 
the duality gap is zero, and therefore no guarantee about saddlepoints. Nonetheless, 
just as in other domains, first order methods here seem to work well in practice. 

8 Note though that since the gradient of entropy is infinite at the border of the 
unit simplex, we are guaranteed that no component of q will ever exactly equal 0, 
which typically means that the constraints {fi(x) = 0} will never be satisfied with 
probability exactly 1. 

9 Formally speaking, the partial derivative is given by Ui(j). Intuitively, the reason 
for subtracting , u;(a:( )/|AL| is to keep the distribution in the set of all possible 
probability distributions over x, V. 
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Eq. 3 specifies the change that each agent should make to its distribution to 
have them jointly implement a step in steepest descent of the maxent Lagrangian. 
These updates are completely distributed, in the sense that each agent’s update at 
time t is independent of any other agents’ update at that time. Typically at any 
t each agent i knows qi(t) exactly, and therefore knows ln[?i (.?)]• However often it 
will not know G and/or the qyy In such cases it will not be able to evaluate the 
E(G | Xi = j) terms in Eq. 3 in closed form. 

One way to circumvent this problem is to have those expectation values be si- 
multaneously estimated by all agents by repeated Monte Carlo sampling of q to 
produce a set of ( x,G(x )) pairs. Those pairs can then be used by each agent i to 
estimate the values E(G \ Xi = j), and therefore how it should update its distri- 
bution. In the simplest version of this, an update to q only occurs once every 
time-steps. In this scheme only the samples (x,G(x)) formed within a block of 
successive time-steps are used at the end of that block by the agents to update their 
distributions (according to Eq. 3). 

There are numerous other schemes besides gradient descent for finding minima 
of the Lagrangian. One of these is a second order version of steepest descent, con- 
strained to operate over Q. This scheme, called “Nearest Newton” [31], starts by 
calculating the point p £ V that one should step to from the current distribution 
q , if one were to use Newton’s method to descend the Lagrangian. Now in general 
that p is not a product distribution. So we instead find q' , the q £ Q that is closest 
(as measured by Kullbach-Leibler distance) to that point p; the step actually taken 
is to q' . 

This step from the current q turns out to be indentical to the gradient descent 
step, just with an extra multiplicative factor of qi{xi = j) multiplying each associ- 
ated component of that gradient descent step: 

A( qi ( Xi = j)) oc qi{j)ui(j) - (4) 

X ■ 

where Ui is as defined just below Eq. 3, and the proportionality constant is the step 
size. 

In the continuum time limit, this step rule reduces to the replicator equation of 
evolutionary game theory, only with an entropic term added in [30]. (Intuitively, that 
entropic term ensures the evolution explores sufficiently.) This connection can be 
viewed as a first-principles justification for (particular versions of) evolution-based 
search algorithms, e.g., genetic algorithms. To be precise, say we have a biological 
population of many “genes”, each specifying a value x, and an associated “fitness 
function” G(x). Have the frequency of each gene in the population be updated via 
the replicator dynamics, as usual in evolutionary game theory. We can justify this 
evolution-based search algorithm as the /3 — > oo limit of Nearest Newton for the 
case of a single agent with moves x. By allowing (3 < oo, we can extend those 
evolution-based search algorithms in a principled manner. 

A final example of a Lagrangian descent scheme, which is analogous to block 
relaxation, is “Brouwer updating” [30]. In that kind of updating one or more agents 
simultaneously jump to their optimal distribution, as given by Eq. 2 (with (3 rather 
than 7 specified, as discussed above). It turns out that if the expectations defining 
the Brouwer updating are “exponentially aged” to reflect nonstationarity, then in 
the continuum time limit Brouwer updating becomes identical to Nearest Newton. 



Distributed Control 


7 


The aging constant in Brouwer updating turns out to be identical to the step size 
in Nearest Newton. 

All of the update schemes can be used so long as each agent % knows or can 
estimate qt together with E q{i) (G \ Xi) = E(G \ Xi) for all of its moves Xi. No other 
quantities are involved. 


3 Semicoordinate transformations 

3.1 Motivation 

Consider a multi-stage game like chess, with the stages (i.e., the instants at which 
one of the players makes a move) delineated by t. In game theoretic terms, the 
“strategy” of a player is the board-configuration — > response rule it adopts before 
play starts [13, 4, 24, 3, 12]. More generally, in a multi-stage game like chess the 
strategy of player i, Xi, is the set of t-indexed maps taking what that player has 
observed in the stages t' < t into its move at stage t. Formally, this set of maps is 
called player i’s normal form strategy. 

The joint strategy of the two players in chess sets their joint move-sequence, 
though in general the reverse need not be true. In addition, one can always find a 
joint strategy to result in any particular joint move-sequence. Now typically at any 
stage there is overlap in what the players have observed over the preceding stages. 
This means that even if the players’ strategies are statistically independent (being 
separately set before play started), their move sequences are statistically coupled. In 
such a situation, by parameterizing the space Z of joint-move-sequences a with joint- 
strategies x, we shift our focus from the coupled distribution P(z) to the decoupled 
product distribution, q(x). This is the advantage of casting multi-stage games in 
terms of normal form strategies. 

More generally, any onto mapping (:*-*?, not necessarily invertible, is called 
a semicoordinate system. The identity mapping 2 — > z is a trivial example of a 
semicoordinate system. Another example is the mapping from joint-strategies in a 
multi-stage game to joint move-sequences is an example of a semicoordinate system. 
In other words, changing the representation space of a multi-stage game from move- 
sequences 2 to strategies a: is a semicoordinate transformation of that game. 

Intuitively, a semi-coordinate transformation is a reparameterization of how a 
game — a mapping from joint moves to associated payoffs — is represented. So 
we can perform a semicoordinate transformation even in a single-stage game. Say 
we restrict attention to distributions over X that are product distributions. Then 
changing £(.) from the identity map to some other function means that the players’ 
moves are no longer independent. After the transformation their move choices — 
the components of 2 — are statistically coupled, even though we are considering a 
product distribution. 

Formally, this is expressed via the standard rule for transforming probabilities, 

Pz(z €Z) = C (Px) = j dxP x (x)S(z - C(a:)), (5) 

where Px and Pz are the distributions across A' and Z, respectively. To see what 
this rule means geometrically, let V be the space of all distributions (product or 
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otherwise) over Z. Recall that Q is the space of all product distributions over X, 
and let ((Q) be the image of Q in V. Then by changing £(.), we change that image; 
different choices of </(.) will result in different manifolds C(2)- 

As an example, say we have two players, with two possible moves each. So z 
consists of the possible joint moves, labeled (1, 1), (1, 2), (2, 1) and (2, 2). Have X = 
Z, and choose C(l, 1) = (1, 1), C(l, 2) = (2, 2), C(2, 1) = (2, 1), and C(2, 2) = (1, 2). 
Say that q is given by q\(x\ = 1) = q 2 {x 2 = 1) = 2/3. Then the distribution over 
joint-moves z is P z ( 1,1) = -Pxr(l, 1) = 4/9, P z { 2,1) = P z { 2,2) = 2/9, Pz(l,2) = 
1/9. So P z (z) ^ Pz(zi)P z (z 2 )', the moves of the players are statistically coupled, 
even though their strategies Xi are independent. 

Such coupling of the players’ moves can be viewed as a manifestation of sets 
of potential binding contracts. To illustrate this return to our two player example. 
Each possible value of a component Xi determines a pair of possible joint moves. 
For example, setting x\ = 1 means the possible joint moves are (1,1) and (2,2). 
Accordingly such a value of x t can be viewed as a set of proffered binding contracts. 
The value of the other components of x determines which contract is accepted; it is 
the intersection of the proffered contracts offered by all the components of x that 
determines what single contract is selected. Continuing with our example, given that 
xi = 1, whether the joint-move is (1, 1) or (2,2) (the two options offered by xi) is 
determined by the value of X2- 

To relate semicoordintes to distributed control we have to fix some notation. 
To maintain consistency with the discussion of maxent Lagrangians, we will have 
product distributions q(x £ X) £ Qx ■ Also as before, to allow stochasticity, we 
write the ultimate space of interest as y with associated cost function F(y). This 
means that x sets z which stochastically sets y: 

Eq{F) = J dy P(y)F(y) = J dz P(z)G{z) = J dx q(x)G(x) (6) 


where 

G{z) = J dy F(y)P(y \ z) (7) 

= J dx G(x)5(£(x) — z); 

G(x) = j dy F(y)P(y \ x) (8) 

= J d V F{y)P(y \ ((x)). 

3.2 Representational properties 

Binding contracts are a central component of cooperative game theory. In this sense, 
semicoordinate transformations can be viewed as a way to convert noncooperative 
game theory into a form of cooperative game theory. Indeed, any cooperative mixed 
strategy can be cast as a non-cooperative game mixed strategy followed by an ap- 
propriate semicoordinate transformation. Formally, any P z , no matter what the 
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coupling among its components, can be expressed as £(Px) for some product distri- 
bution Px for and associated £(.) 10 

Less trivially, given any model class of distributions { Pz }, there is an X and 
associated £(.) such that {Pz} is identical to C (Qx). Formally this is expressed in 
a result concerning Bayes nets. For simplicity, restrict attention to finite Z. Order 
the components of Z from 1 to N . For each index i £ {1,2,..., N}, have the parent 
function V(i,z) fix a subset of the components of z with index greater than i, 
returning the value of those components for the z in its second argument if that 
subset of components is non-empty. So for example, with N > 5, we could have 
V(l,z) = (Z 2 ,Z 5 ). Another possibility is that V(l,z) is the empty set, independent 
of z. 

Let A(V) be the set of all probability distributions Pz that obey the conditional 
dependencies implied by V: V Pz £ A(V),z £ Z, 

N 

Pz(z) = HPz(z i \V(i,z)). (9) 

i— 1 

(By definition, if P(i,z)) is empty, Pz(zi \ is just the i’th marginal of Pz, 

Pz(zi).) Note that any distribution Pz is a member of A{ V) for some V — in the 
worst case, just choose the exhaustive parent function V(i, z) = {zj : j > i}. 

For any choice of V there is an associated set of distributions C(Qa') that equals 
A(V) exactly: 

Theorem 1: Define the components of X using multiple indices: For all i £ 
{1, 2, . . . , N} and possible associated values (as one varies over z £ Z) of the vector 
V(i,z), there is a separate component of x, Xi ; -p(i,z)- This component can take on 
any of the values that z% can. Define £(.) recursively, starting at * = N and working 
to lower i, by the following rule: V i £ {1, 2, ... , N}, 

[C(*)]i = x i;V(i,z) • 


Then A(V) = C (Qx). 

Proof: First note that by definition of parent functions, due to the fact that we’re 
iteratively working down from higher i’s to lower ones, (}{x ) is properly defined. Next 
plug that definition into Eq. 5. For any particular x and associated z = £(*), those 
components of x that do not “match” z by having their second index equal V(i,z) 
get integrated out. After this the integral reduces to 

N 

Pz (z) ^ ^ Px ([■*h;'P(i,z)] Z , ) , 

i = 1 

i.e. , is exactly of the form stipulated in Eq. 9. Accordingly, for any fixed x and 
associated z = C(*)> ranging over the set of all values between 0 and 1 for each 
of the distributions Px{[xi ; v(i,z) = Zi) will result in ranging over all values for the 
distribution Pz{z) that are of the form stipulated in Eq. 9. This must be true for 

10 In the worst case, one can simply choose A' to have a single component, with £(.) 
a bijection between that component and the vector z — trivially, any distribution 
over such an X is a product distribution. 
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all x. Accordingly, ((Qx) Q A(V). The proof that A( V) C ((Qx) goes similarly: 
For any given Pz and z, simply set Px{[xi-,v(i,z)\ = zf) for all the independent 
components of x and evaluate the integral in Eq. 5. QED. 

Intuitively, each component of x in Thm. 1 is the conditional distribution Pz(zt \ 
V(i,z)) for some particular instance of the vector V(i, z)). Thm. 1 means that in 
principle we never need consider coupled distributions. It suffices to restrict attention 
to product distributions, so long as we use an appropriate semicoordinate system. 
In particular, mixture models over Z can be represented this way. 


3.3 Maxent Lagrangians over X rather than Z 

While the distribution over A' uniquely sets the distribution over Z , the reverse 
is not true. However so long as our Lagrangian directly concerns the distribution 
over X rather than the distribution over Z , by minimizing that Lagrangian we 
set a distribution over Z. In this way we can minimize a Lagrangian involving 
product distributions, even though the associated distribution in the ultimate space 
of interest is not a product distribution. 

The Lagrangian we choose over X should depend on our prior information, 
as usual. If we want that Lagrangian to include an expected value over Z (e.g., 
of a cost function), we can directly incorporate that expectation value into the 
Lagrangian over X, since expected values in A' and Z are identical: f dzPz{z)A(z) = 
f dxPx(x)A(((x)) for any function A(z). (Indeed, this is the standard justification 
of the rule for transforming probabilities, Eq. 5.) 

However other functionals of probability distributions can differ between the 
two spaces. This is especially common when £(.) is not invertible, so X is larger 
than Z. In particular, while the expected cost term is the same in the A' and Z 
maxent Lagrangians, this is not true of the two entropy terms in general; typically 
the entropy of a q £ Q will differ from that of its image, £( q ) £ C(2) i 11 such a case. 

More concretely, the fully formal definition of entropy includes a prior probability 
fj,: Sx = f dxp(x) ln(^y), and similarly for Sz- So long as p(x) and fj,(z) are related 
by the normal laws for probability transformations, as are p(x) and p(z), then if the 
cardinalities of X and Z are the same, Sz = Sx 11 • When the cardinalities of the 
spaces differ though (e.g., when X and Z are both finite but with differing numbers 
of elements), this need no longer be the case. The following result bounds how much 
the entropies can differ in such a situation: 

Theorem 2: For all z £ Z, take n(x) to be uniform over all x such that £(*) = z. 
Then for any distribution p(x) and its image p(z), 

— f dz p(z) \n(K (z)) < Sx — Sz < 0, 


11 For example, if X = Z = R, then ln[^$J = = M^y], where 

J<;(x) is the determinant of the Jacobian of £(.) evaluated at x. Accordingly, as far as 
transforming from X to Z is concerned, entropy is just a conventional expectation 
value, and therefore has the same value whichever of the two spaces it is evaluated 
in. 
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where K(z) = f dx8(z — £(x)). (Note that for finite A' and Z, K(z) > 1, and counts 
the number of x with the same image z.) If we ignore the p terms in the definition 
of entropy, then instead we have 


Proof: Write 


0 < Sx — Sz < — J dz p(z) \n(K (z)) . 

Sx = — J dz J dx 5(z — ((x)) p(x) ln[^j— |] 
= — J dz J dx 5(z — ((x)) p(x) x 

= — J dz p(z)ln[d(z)\ — 
Jdzjdx6(z-«x))p(x) 


where d z = f dx 5(z — CM) Define to be the common value of all p(x) 

such that CM = So p,(z) = p z K(z) and p(z) = p z d(z). Accordingly, expand our 
expression as 


P(z) 
p{z) 

JdzJdx5(z-«x))p( x )H 

= Sz — J dz p(z)K(z) + 

J dz p(z) (- J dx 5{z - CM) pjfy Mpjfj])- 


/ 


] - / dz p(z)K(z) - 


Sx = — 


J dz p(z) 


M 


The ^-integral of the right-hand side of the last equation is just the entropy of 
normalized the distribution defined over those x such that CM = Its maxi- 
mum and minimum are ln[A'(z)] and 0, respectively. This proves the first claim. The 
second claim, where we “ignore the p terms”, is proven similarly. QED. 


In such cases where the cardinalities of X and Z differ, we have to be careful 
about which space we use to formulate our Lagrangian. If we use the transformation 
C(.) as a tool to allow us to analyze bargaining games with binding contracts, then the 
direct space of interest is actually the x’s (that is the place in which the players make 
their bargaining moves). In such cases it makes sense to apply all the analysis of the 
preceding sections exactly as it is written, concerning Lagrangians and distributions 
over x rather than 2 (so long as we redefine cost functions to implicitly pre-apply 
the mapping £(.) to their arguments). However if we instead use £(•) simply as a 
way of establishing statistical dependencies among the moves of the players, it may 
make sense to include the entropy correction factor in our x-space Lagrangian. 

An important special case is where the following three conditions are met: Each 
point 2 is the image under ((.) of the same number of points in x-space, n; p(x) 
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is uniform (and therefore so is p(z))\ and the Lagrangian in *-space, P£ x , is a sum 
of expected costs and the entropy. In this situation, consider a 2 -space Lagrangian, 
2z? z , whose functional dependence on P z , the distribution over z’s, is identical to the 
dependence of 2z? x on P x , except that the entropy term is divided by n 12 . Now the 
minimizer P*(x ) of P£ x is a Boltzmann distribution in values of the cost function(s). 
Accordingly, for any 2 , P*(x) is uniform across all n points x £ (~ 1 (z) (all such x 
have the same cost value(s)). This in turn means that S(((P X )) = nS(P z ). So our 
two Lagrangians give the same solution, i.e., the “correction factor” for the entropy 
term is just multiplication by n. 

3.4 Exploiting semicoordinate transformations 

This subsection illustrates some way to exploit semicoordinate transformations to 
facilitate descent of the Lagrangian. To illustrate the generality of the arguments, 
situations where one has to to use Monte Carlo estimates of conditional expectation 
values to descend the shared Lagrangian (rather than evaluate them closed-form) 
will be considered. 

Say we are currently at a local minimum q £ Q of 2z? . Usually we can break out 
of that minimum by raising fi and then resuming the updating; typically changing fi 
changes 2*f so that the Lagrange gaps are nonzero. So if we want to anneal fi anyway 
(e.g., to find a minimum of the shared cost function G), it makes sense to do so to 
break out of any local minima. 

There are many other ways to break out of local minima without changing the 
Lagrangian (as we would if we changed fi , for example) [31]. Here we show how 
to use semicoordinate transformations to do this. As explicated below, they also 
provide a general way to lower the value of the Lagrangian, whether or not one has 
local minimum problems. 

Say our original semicoordinate system is C 1 ]-)- Switch to a different semico- 
ordinate system <f 2 (.) for Z and consider product distributions over the associated 
space A' 2 . Geometrically, the semicoordinate transformation means we change to a 
new submanifold C, 2 (Q) C V without changing the underlying mapping from p(z) 
to &z(p). 

As a simple example, say (f 2 is identical to (j 1 except that it joins two components 
of x into an aggregate semicoordinate. Since after that change we can have statistical 
dependencies between those two components, the product distributions over A' 2 , 
<f 2 (Qx 2 ), map to a superset of C 1 (2x 1 )- Typically the local minima of that superset 
do not coincide with local minima of C 1 (2.\' 1 )- So this change to A' 2 will indeed 
break out of the local minimum, in general. 

More care is needed when working with more complicated semicoordinate trans- 
formations. Say before the transformation we are at a point p* £ (f {Qx 1 )- Then in 
general p* will not be in the new manifold ( 2 (Qx 2 ), he., p* will not correspond to a 
product distribution in our new semicoordinate system. (This reflects the fact that 
semicoordinate transformations couple the players.) Accordingly, we must change 
from p* to a new distribution when we change the semicoordinate system. 

To illustrate this, say that the semicoordinate transformation is bijective. For- 
mally, this means that A 2 = A 1 = A and C, 2 (x) = f 1 (£(*)) f° r a bijective £(.). 

12 For example, if Jf x (P x ) = pE Px (G{t(.)))-S(P x ), then jgf f(P z ) = pE Pz (G{.))- 
S(P z )/n, where P x and P z are related as in Eq. 5. 
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Have £(.), the mapping from X 2 to X 1 , be the identity map for all but a few of 
the M total components of X, indicated as indices 1 — > n. Intuitively, for any fixed 
= Xn+i^M, the effect of the semicoordinate transformation to £ 2 (.) from 
£ x (.) is merely to “shuffle” the associated mapping taking semicoordinates 1 — > n to 
Z, as specified by £(.). Moreover, since £(.) is a bijection, the maxent Lagrangians 
over A' 1 and X 2 are identical: -£oc 1 (£(j )A )) = -5?x 2 ((p X ))• 

y-2 y 1 

Now say we set q' n+1 ^ M = This means we can estimate the ex- 

pectations of G conditioned on possible x 2 ^ n from the Monte Carlo samples 
conditioned on £(x 2 ^ n ). In particular, for any £(.) we can estimate E{G ) as 
J dx\^ n p x (®i _^ n )E(G | £(®i,. ..,„)) in the usual way. Now entropy is the sum of 
the entropy of semicoordinates n + 1 — > M plus that of semicoordinates 1 — > n. So 
for any choice of £(.) and q x _> n , we can approximate 2 £x = -S?x 2 as ( our associated 
estimate of) E{G) minus the entropy of p X -^ n , minus a constant unaffected by choice 
of £(.). 

So for finite and small enough cardinality of the subspace |A’i_, n |, we can use our 
estimates E(G | £(®i^„)) to search for the “shuffling” £(.) and distribution q x ^ n 
that minimizes 5£ x 13 . In particular, say we have descended I£x to a distribution 
q x ( x ) = q*(x). Then we can set q x = q* , and consider a set of of “shuffling 
£(.)”. Each such £(.) will result in a different distribution q x ( x ) = q x (C _1 (*)) = 
g*(^ _1 (x)). While those distributions will have the same entropy, typically they will 
have different (estimates of) E(G) and accordingly different local minima of the 
Lagrangian. 

Accordingly, searching across the £(.) can be used to break out of a local min- 
imum. However since E(G) changes under such transformations even if we are not 
at a local minimum, we can instead search across £(.) as a new way (in addition 
to those discussed above) for lowering the value of the Lagrangian. Indeed, there 
is always a bijective semicoordinate transformation that reduces the Lagrangian: 
simply choose £(.) to rearrange the G(x) so that G(x) < G(x') <t=> q(x) < q(x'). In 
addition one can search for that £(.) in a distributed fashion, where one after the 
other each agent i rearranges its semicoordinate to shrink E(G). Furthermore to 
search over semicoordinate systems we don’t need to take any additional samples of 
G. (The existing samples can be used to estimate the E(G) for each new system.) 
So the search can be done off-line. 

To determine the semicoordinate transformation we can consider other factors 
besides the change in the value of the Lagrangian that immediately arises under the 
transformation. We can also estimate the amount that subsequent evolution under 
the new semicoordinate system will decrease the Lagrangian. We can estimate that 
subsequent drop in a number of ways: the sum of the Lagrangian gaps of all the 
agents, gradient of the Lagrangian in the new semicoordinate system, etc. 

3.5 Distributions over semicoordinate systems 

The straightforward way to implement these kinds of schemes for finding a good 
semicoordinate systems is via exhaustive search, hill-climbing, simulated annealing, 


13 penalizing by the bias 2 plus variance expression if we intend to do more Monte 
Carlo — see [28]. 
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or the like. Potentially it would be very useful to instead find a new semicoordinate 
system using search techniques designed for continuous spaces. When there are a 
finite number of semicoordinate systems (i.e., finite X and Z ) this would amount 
to using search techniques for continuous space to optimize a function of a variable 
having a finite number of values. However we now know how to do that: use PD 
theory. In the current context, this means placing a product probability distribution 
over a set of variables parameterizing the semicoordinate system, and then evolving 
the probability distribution. 

More concretely, write 


N 

^)^EE f, ( 9 )II«'W G K( i ' fl )) + s (9) ( 10 ) 

0 x i= 1 

N 

q i (x i )P(e)G(ax,e)) + S(q) 

0 x i= 1 

where 8 is a parameter on the semicoordinate system. We can rewrite this using an 
additional semicoordinate transformation, as 

N+l 

W) =/ 3 ^n Vi(x*i)G( <(**)) + s(q*) (11) 

x* i= 1 

where x* = x t for all i up to N, and x* N+1 = 8. (As usual, depending on what space 
we cast our Lagrangian in, the entropy can either have the argument of the entropy 
term starred — as here — or not.) 

Intuitively, this approach amounts to introducing a new coordinate/agent, whose 
“job” is to set the semicoordinate system governing the mapping from the other 
agents to a 2 value. This provides an alternative to periodically (e.g., at a local 
minimum) picking a set of alternative semicoordinate systems and estimating which 
gives the biggest drop in the overall Lagrangian. We can instead use Nearest New- 
ton, Brouwer updating, or what have you, to continuously search for the optimal 
coordinate system as we also search for the optimal x. The tradeoff, of course, is that 
by introducing an extra coordinate/agent, we raise the noise level all the original 
semicoordinates experience. (This raises the issue of what best parameterization of 
£(.) to use, an issue not addressed here.) 


4 PD theory for uncountable Z 

In almost all computational algorithms for finding minima, and in particular in the 
algorithms considered above, we can only modify a finite set of real numbers from one 
step to the next. When Z is finite, we accomodate this by having the real numbers 
be the values of the components of the qi. But how can we use a computational 
algorithm to find a minimum of the maxent Lagrangian when Z is uncountable? 

One obvious idea is to have the real numbers our algorithm works with parame- 
terize p differently from how they do with product distributions. For example, rather 
than product distributions, we could use distributions that are mixture models. In 
that case the real numbers are the parameters of the mixture model; our algorithm 
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would minimize the value of the Lagrangian over the values of the parameters of the 
mixture model. 

An alternative set of approaches still use product distributions, with all of its 
advantages, but employs a special type of semicoordinate system for Z . For peda- 
gogical simplicity, say that Z is the reals between 0 and 1. So £ must be a semi- 
coordinate system for the reals, i.e. , each must map to a single z£(. Now we 

want to have those of the qi that we’re modifying be probability distributions, not 
probability density functions (pdf’s), so that our computational algorithm can work 
with them. Accordingly, in our minimization of the Lagrangian we do not directly 
modify coordinates that can take on an uncountable number of values (generically 
indicated with superscript 2), but only coordinates that take on a finite number of 
values (generically indicated with superscript 1). 

We illustrate this for the minimization schemes considered in the preceding sec- 
tions. For generality, we consider the case where Monte Carlo sampling must be 
used to estimate the values of E(G \ x 1 ) arising in those schemes. Accordingly, we 
need two things. The first is a way to sample q to get an z, which then provides a 
G value. The second is a way to estimate the quantities E(G \ x 1 ) based upon such 
empirical data. Given those, all the usual algorithms for searching q 1 to minimize 
the Lagrangian hold; intuitively, we treat the q 2 like stochastic processes that reside 
in Z but not A', and therefore not directly controllable by us. 


4.1 Reimann semicoordinates 


In the Reimann semicoordinate system, x 1 can take values 0, 1, ..., B — 1, and x 2 
is the reals between 0 and 1. Then with oti = i/B, we have 

z = a x i + x 2 IB (12) 

O.j.1 T X (tt x l_j_;L OL x l). 

We then fix q 2 {x 2 ) to be uniform. So all our minimization scheme can modify are 
the B values of q (ar). 

To sample q, we simply sample q 1 to get a value of x 1 and q 2 to get a value of 
x 2 . Plugging those two values into Eq. 13 gives us a value of z. We then evaluate 
the associated value of the world utility; this provides a single step in our Monte 
Carlo sampling process. 

Next we need a way to use a set of such Monte Carlo sample points to estimate 
E(G | x 1 ) for all x 1 . We could do this with simple histogram averaging, using 
Laplace’s law of succession to deal with bins (x 1 values) that aren’t in the data. 
Typically though with continuous Z we expect F(z) to be smooth. In such cases, 
it makes sense to allow data from more than one bin to be combined to estimate 
E{G | x 1 ) for each x 1 , by using a regression scheme. 

For example, we could use the weighted average regression 


F(z) 


E, Fie -^- Zi)2/2 ° 2 

’^2 i e~G~ z i) 2 / 2<t 2 ’ 


(13) 


where a is a free parameter, Zi is the i’th value of z out of our Monte Carlo samples, 
and Fi is the associated i’th value of F. Given such a fit, we would then estimate 
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E(G | x 1 ) = J dx 2 q 2 (x 2 )F(((x 1 , x 2 )) 

« J dx 2 q 2 (x 2 )F(C(x 1 ,x 2 )). (14) 

This integral can then be evaluated numerically. 

Typically in practice one would use a trapezoidal semicoordinate system, rather 
than the rectangular illustrated here. Doing that introduces linear terms in the 
integrals, but those can still be evaluated as before. 


4.2 Lebesgue semicoordinates 


The Lebesgue semicoordinate system generalizes the Rcimann system, by param- 
eterizing it. It does this by defining a set of increasing values {ao, 02 , •••, Ob} that 
all lie between 0 and 1 such that Qo = 0 and as = 1. We then write 

z = a x 1 +x 2 {a x i +1 -a x i). (15) 

Sampling with this scheme is done in the obvious way. The expected value of G 
if q 2 is uniform (i.e., all x 2 are equally probable) is 


E{G) ^TM*) J dx 2 q 2 (x 2 )F[a x i +x 2 (a x i +1 - a x i)] 

l J OL 1 + l 


(16) 


and similarly for E(G \ x 1 ). When the at are evenly spaced, the Lebesgue system 
just reduces to the Rcimann system, of course. 

Note that for a given value of x 1 , we have probability mass 1 in the bin follow- 
ing a x i. So g 1 (x 1 ) sets the cumulative probability mass in that bin. Changing the 
parameters ca will change what portion of the real line we assign to that mass — 
but it won’t change the mass. 

This may directly affect the Lagragian we use, depending on whether it’s the X- 
space Lagrangian or the Z-space one. In the Reimann semicoordinate system, Sx oc 
Sz, and both Lagrangians are the same (just with a rescaled Lagrange parameter). 
However in the Lebesgue system, if the a ( are not evenly spaced, those two entropies 
are not proportional to one another. Accordingly, in that scenario, one has to make 
a reasoned decision of which maxent Lagrangian to use. 

The {cti} are a finite set of real numbers, just like q 1 . Accordingly, we can 
incorporate them along with q 1 into the argument of the maxent Lagrangian, and 
search for the Lagrangian-minimizing set {at} and q 1 together 14 . I 11 fact, one can 
even have q 1 fixed, along with q 2 , and only vary the {at}. The difference between 
such a search over the {ai} when q 1 is fixed, and a search over q 1 when the {a;} 
are fixed, is directly analogous to the difference between Reimann and Lebesgue 
integration, in how the underlying distribution P(z) is being represented. 


14 Compare this to the scheme discussed previously for searching directly over 
semicoordinate transformations, where here the search is over probability distribu- 
tions defined on the set of possible semicoordinate transformations. 
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Whether or not q 1 is also varied, one must be careful in how one does the search 
for each a:;. Unlike for each {(/;}, each on does not arise as a multilinear product, and 
therefore appears more than once in the Lagrangian. For example, any particular 
a x i term arises in Eq. 16 twice as a limit of an integral, and twice in an integrand. 
All four instances must be accounted for in differentiating the E(G) term in the 
Lagrangian with respect to that a x i term. 


4.3 Decimal Reimann semicoordinates 

In the standard Reimann semicoordinate system, we use only one agent to decide 
which bin x 1 falls into. To have good precision in making that decision, there must 
be many such bins. This often means that there are few Monte Carlo samples in 
most bins. This is why we need to employ a regression scheme (with its attendant 
implicit smoothness assumptions) to estimate E(G \ Xi ). 

An alternative is to break x 1 into a set of many agents, through a hierarchical 
decimal-type representation. For example, say x 1 can take on 2 K values. Then under 
a binary representation, we would specify the bin by 

K 

x'-X' 1 '' 2 ' (17) 

i= 1 

where xj is the bit specifying agent %' s value. With this change updating the La- 
grangian is done by K agents, with each agent i estimating E(G \ xj) for two values 
of xj, rather than by a single agent estimating E{G \ x 1 ) for all 2 K values of x 1 . 

With this system, each agent performs its estimations by looking at those Monte 
Carlo samples where z fell within one particular subregion covering half of [0.0, 1.0]. 
So long as the samples weren’t generated from too peaked a distribution (e.g., early 
in the search process), there will typically be many such samples, no matter what 
bit i and associated bit value x\ we are considering. Accordingly, we do not need 
to perform a regression to estimate E(G \ xj) to run our Lagrangian minimization 
algorithms 15 . When q is peaked, some of bin counts from the Monte Carlo data 
may be small. We can use regression as above, if desired, for such impoverished bins. 
Alternatively, we can employ a Lebesgue-type scheme to update the bin borders, to 
ensure that all xj occur often in the Monte Carlo data. 


5 PD theory for Reinforcement Learning 

In this section we show how to use semicoordinate transformations and PD the- 
ory for a single RL algorithm playing against nature in a time-extended game with 
delayed payoffs. The underlying idea is to “fracture” the single learner across mul- 
tiple timesteps into a set of separate agents, one for each timestep. This gives us a 
distributed system. Constraints are then used to couple those agents. It is straight- 
forward to extend this approach to the case of a multi-agent system playing against 
nature in a time-extended game. 

15 As usual, we could have the entropy term in the Lagrangian be based on either 
X space or Z space. 



18 


Distributed Control 


5.1 Episodic RL 

First consider episodic RL, in which reward comes only at the end of an episode of 
T timesteps. The learner chooses actions in response to observations of the state of 
the system via a policy. It does this across a set of several episodes, modifying its 
policy as it goes to try to maximize reward. The goal is to have it find the optimal 
such policy as quickly as possible. 

To make this concrete, use superscripts to indicate timestep in an episode. So 
z = (z 1 , z 2 , z 3 , . . . z T ) = C( x )- If we assume the dynamics is Markovian, P(z) = 
P(z 1 )P(z 2 | z 1 )P(z 3 | z 2 ) . . . P(z T | z T ~ 1 ). Typically the objective function G 
depends solely on z T . For the conventional RL scenario, each z t can be expressed 
as (s*, a*), where s 4 is the state of the system at t, and a t is the action taken then. 

As an example, say the learner doesn’t take into account its previous actions 
when deciding its current one and that it observes the state of the system (at the 
previous instant) with zero error. Then 

P(z t | z ‘- 1 ) = P{s t ,a t | s t_1 ,o t_1 ) = P{a | s‘ _1 )P(s t | s t_1 ,a t_1 ). (18) 

Have £(.) give us a representation of each of the conditional distributions in the 
usual way using semicoordinates (see Thm. 1). So A' is far larger than Z, and we 
can write P(z) with abbreviated notation as 

P(s°,a °, . . . , s T , a T ) = P(a°)P(s°) P(a \ s t ~ 1 )P(s t | s t_1 ,a t_1 ) 

t> 1 

= <lAo(a 0 )q s o(s 0 ) (! 9 ) 

t> i 

In RL we typically can only control the q t s t-i distributions. While the other qi go 
into the Lagrangian, they are fixed and not directly varied. 

If it is desired to have the policy of the learner be constant across each episode, 
we can add penalty terms Ai[9t,s(a) — <?t+i, s (a)] Vt, s,a to the A'-space Lagrangian 
to enforce time-translation invariance in the usual way 1617 . Time-translation in- 
variance of the P(s t | s t_1 ,a t_1 ) does not explicitly need to be addressed. Indeed, 
it need not even hold. 

Up to an overall additive constant, the resulting A-space Lagrangian is 
}) = P q A° («°)feo (s° ) II 4 m*- 1 («*)&, (s )G(s) 

a,s £>1 

- S(q s o) - y~]S(gt, s ) + Ks,a[<ltA a ) - 1t-l,s( a )] 

t>l t>l,s,a 

(20) 


16 Note that unlike constraints over X, those over Q are not generically true only 
to some high probability, but rather can typically hold with probability 1. 

17 If such constancy is a hard and fast requirement, rather than just desirable, 
then the simplest aproach is simply to have a single agent with a distribution q s (a) 
that sets qt,s(a) Vi. 
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where s and a indicate the vectors of all s* and all a* , respectively, and the entropy 
function S(.) should not be confused with the subscript s° on q (which indicates the 
component of q referring to the time-0 value of the state variable) 18 

We can then use any of the standard techniques for descending this Lagrangian. 
So for example say we use Nearest Newton. Then at the end of each episode, for 
each t > 1 ,s,a, we should increase qt,s(a) by 

a \q t ,s{a)(j3E(G | = s, a = a) + ln(( 7 M (a))+ Xt,s, a -Xt+i,s,a^ - constj , (21) 

where as usual a is the step size and const is the normalization constant (see Eq. 4). 

5.2 Discounted sum RL 

It is worth giving a brief overview of how the foregoing gets modified when we 
instead have a single “episode” of infinite time, with rewards received at every t, 
and the goal of the learner at any instant being to optimize the discounted sum of 
future rewards. 

Let the matrix P be the conditional distribution of state Zt given state Zt-i, and 
7 a real- valued discounting factor between 0 and 1. Write the single-instant reward 
function as a vector R whose components give the value for the various Zt . Then if 
Po is the current distribution of (single-instant) states, z o, the world utility is 


([^( 7 P) t ]Po) • R 

t= 1 

The sum is just a geometric series, and equals x 2. p , where 1 is the identity 
matrix, and it doesn’t matter if the matrix inversion giving the denominator term 
is right-multiplied or left-multiplied by the numerator term. 

We’re interested in the partial derivative of this with respect to one of the entries 
of P (those entries are given by the various qi,j)- What we know though (from our 
historical data) is a (non-IID) set of samples of (yP^P-R for various values of t and 
various (delta-function) P. So it is not as trivial to use historical data to estimate 
the gradient of the Lagrangian as in the canonical optimization case. More elaborate 
techniques from machine learning and statistics need to be brought to bear. 
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